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1. INTRODUCTION 

The challenge of finding suitable hyperparameters for convolution neural network (CNN) is still a 
concern for researchers. CNN has been widely used in computer vision, especially image recognition. In the 
recognition process, CNN has several advantages, such as automatically extracting important features from 
each image as well as, saving memory and complexity. The number of hyperparameters strongly influences 
the CNN performance process. However, hyperparameter optimization is not easy because the accuracy 
depends on the hyperparameters' quality. Furthermore, there is no standard rule in hyperparameter optimization 
because each hyperparameter has certain characteristics or is still local optimum. Therefore, the task of 
hyperparameter optimization is challenging. It is difficult to improve performance with a large number of 
hyperparameters, the weakness of deep neural networks [1]. So, hyperparameter optimization is a solution [2]. 

Hyperparameter optimization techniques have been proposed. Grid search (GS) [3] and random search 
(RS) [4] are popular optimization algorithms. However, the two algorithms do not have a precise learning 
mechanism to produce an optimal solution, and each solution produced is independent. This mechanism is 
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fundamental as part of the search activity, which ensures that the optimal solution is global or not trapped in 
the local optimal in the high dimension of the search space. 

Evolutionary algorithm (EA) provides this ability. EA or Neuroevolution as an optimization tool is, 
popular and widely used to direct the learning process [5], [6]. Based on previous research, neuroevolution is 
played for optimizing global hyperparameters or the small number of hyperparameters. Most of the studies 
related to neuroevolution use genetic algorithm (GA) [7]—[11]. GA has two operators which are crossover and 
mutation. These two operators are used for exploration and exploitation of search space. This paper proposes 
optimization of hyperparameter and architecture on CNN by using GA or CNN-GA. CNN is a deep learning 
model widely applied in various objects because of its high performance such as in computer vision, including 
gender classification [12], image classification [13], [14], vehicle tracking system [15], and e-detection [16]. 

Despite its capabilities, CNN still has some challenges. The selection of hyperparameters strongly 
influences its performance. Empirical studies by [17] with text mining data show the strong influence of the 
larger number of hyperparameters and values of hyperparameter on CNN performance. However, this claim is 
not necessarily proven in different data. Meanwhile, many papers on CNN hyperparameter using 
neuroevolution focus on image classification [9], [10], [18]-[23]. Therefore, we focus on CNN optimization 
using GA on image recognition. 

The paper provides some contributions. First, the CNN-GA optimizes 20 hyperparameters for English 
handwritten recognition. To our knowledge, no research has optimized all hyperparameters (Global, 
architecture, and layer). Previous research only focused on global parameters or layers. Max-norm weight 
constraints in convolutional and dense layers also need to be optimized because values significantly impact the 
search process, and different values affect model performance. Second, this paper designed seven experiments 
based on the different number of hyperparameters and the different number of optimized hyperparameters. 
Based on the study of [17], the larger the number of hyperparameters tends to produce the best model 
performance. 


2. METHOD 

The flow chart for the proposed method is shown in Figure 1. The first stage is pre-processing data. 
This research used English handwritten recognition (HR) obtained from the National Institute of Standards and 
Technology (NIST) [24]. The number of datasets was 372.450 records. The details are shown in Figure 2. This 
figure showa that the distribution of each alphabet digit number the is unbalance, so it becomes an imbalance 
dataset. Therefore, this research useds the under-sampling approach to balance the datasets. The balance dataset 
is shown in Figure 3. Next, the image size is reshaped and converted into grayscale, then divided or split into 
training data and testing data with a proportion of 80:20. 


Figure 1. Workflow of the proposed GA-CNN algorithm 


We operated standard GA for hyperparameter optimization [25], in which there were five processes: 
population initialization, evaluation, selection, crossover, and mutation. The initialization of the population used 
is similar to the study of [17]. However, we eliminated embedding in this process. The initialization of the 
population is shown in Figure 4. Apart from the same dense layers number (ND) and convolutional layers 
number (NC), each individual has the same chromosome length based on the number of hyperparameters 
optimized for each experiment. The list, range, and default of hyperparameter values are presented in Table 1. 
We refer to the study of [17] that the number of optimized hyperparameters affects the model's performance. 
Therefore, we conducted seven experiments as done by the study of [17]. This study generated genes for all 
hyperparameter layers for each layer up to a maximum of NC and ND. This generation process was carried out 
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to facilitate crossover and mutation. As a result, the number of genes in populastion size (PS) is a maximum of 
NC1. This number is based on the max-pooling layer following every convolution layer except the last layer. 
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Figure 3. The balance of A-Z English HR datasets 
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Figure 4. Population initialization of the CNN-GA 


After the initialization process, the evaluation process was based on the fitness function, namely the 
Confusion matrix (accuracy, precision, Fl-score, and recall). Figure 1 shows the evaluation process of 
transferring the architecture and the resulting hyperparameter values to CNN. Next, CNN returmeds the fitness 
value. The higher the fitness value, the greater the chance to survive and be selected for the next generation. 

Furthermore, the proposed model useds three operators: crossover, mutation, and selection. This 
research used a uniform mutation type and randomly selected three types of crossover (one-point, two-point, 
and uniform). NC and ND produced a mated architecture using a one-point crossover operator with a global 
max-pooling layer. These layers serve as intersection points, which mutate dense layers, or add or remove 
convolution layers. The selected individual is the best from each generation, which was then selected using 
elitism. 
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Table 1. Hyperparameter range and value 
Hyperparameter Description Range/Values 
Global NE The Number of a:1,8:100, E: 10 

epochs 

BS Batch Size a:32, B::256, E: 32 

OP Optimizer [‘'Nadam”, 'Adagrad', 'Adadelta’, 'Sgd', 'Rmsprop', ‘adam', 'Adamax'] @: 'Adam' 

LR Learning Rate a:le-4, B : le-2, E: le-4 

MO Momentum a: 0.0, B: 1.0, E: 0.9 

Layer NF Number of filters a:32, B:512, E: 64 

(Convolution) 

KS Kernel Size a: 1, B:5,E:3 
(Convolution) 

AFC Activation function [Sf’, 'El', 'SL', 'SF’, 'SG’, 'Tn', Sid', 'HsG', 'Linear'], @: @: ‘RL’ 
(Convolutional) 

KIC Kernel initializer ['Zeros', 'Glorot_Uniform’, 'Ones', 'Uniform', ‘Normal’, 'Glorot_Normal', 
(Convolution) "Lecun_Normal', Lecun_Uniform', 'He_Normal','He_Uniform'] €: @: ‘Glorot Uniform’ 

WC Max-norm weight a: 1, B:5,E:3 

Cc constraint 
(Convolutional) 

NN Number of neuron a:1,B:5,E:1 
(Dense) 

AFD Activation function — ['RL', 'Sf, 'Elu', 'SL', 'SF’, 'SG’, 'Tn', Sid’, 'HsG’, 'Linear'], @: 'RL' 
(Dense) 

KID Kernel initializer [Zeros','Ones', 'Uniform', 'Lecun_Normal', Lecun_Uniform' 'Normal', 'Glorot_Normal', 
(Dense) ‘Glorot_Uniform’, 'He_Normal','He_Uniform'], @: 'Glorot_Uniform' 

WC Max-norm weight a:1,B:5,E:3 

D constraint (Dense) 

DR Drop rate a: 0.0, 6 1.0, E: 0.2 
(Dropout) 

PS Pool size (Max- a: 2, B:6,E:5 
pooling) 

KIO Kernel initializer ['Zeros,’Glorot_Normal’,'Ones', Uniform’, 'Normal', Glorot_Uniform', 


(Output) 


"He_Normal','He_Uniform',], 
e: 'Glorot_Uniform' 


Architect NC Convolutional a:1,B:15,@:1 
ure layers number 
ND Dense layers a: 0, 8: 15,E: 1 


number 
Note: a: Min; B:Max, @: Default, Sf:softmax,El:Elu,SL: Selu,SF: Softplus, SG: Softsign, Tn:Tanh, Sid: Sigmoid, HsG: 
Hard_Sigmoid,RL; Relu 


3. RESULT AND DISCUSSION 

This section discusses the performance of CNN+GA at obtaining the near-optimum combination of 
the hyperparameters and architecture of CNN. English HR dataset was used to assess the performance of the 
model [24]. In this study, splits the dataset was split by 80% for training and 20% for testing. The training 
dataset is also split training dataset and the validation dataset with 80:20 ration [26]. The recognition of English 
HR is an image recognition that recognises the digits of the alphabet into 28 classes, namely A-Z or a-z. In this 
experiment, the optimisation method used the training and validation data to produce the near-optimum 
architecture and combination of hyperparameters. Python was used to implement all hyperparameter 
optimizations. SciKit-Learn [3] calculated the confusion matrix and visualization tools using Matplotlib [27] 
and Seaborn. Pandas [28] are used to process datasets, and NumPy [29] handle all scientific computing. 
Tensorflow [30] and Keras libraries were used to build the CNN model. Distributed evolutionary algorithms 
in Python (DEAP) was used to create GA. Finally, Google colabPro Plus with GPU high ram was used to 
perform all the experiments in Table 2. 

The use of different hyperparameter values in each layer is marked with an asterisk (*). We refer to 
[17] to perform seven experiments based on the different number of hyperparameters and the different number 
of optimized hyperparameters as presented in Table 2. In [17], the optimization of the number of 
hyperparameters depends on whether each layer will have the same or different values. This is realized by 
putting one star (*) on the hyperparameter layer, which means that the hyperparameter is selected once and is 
the same for all layers. Meanwhile, the hyperparameter layer with two stars (**) means that each layer will 
have a different hyperparameter value. Experiment pairs 2-3, 4-5, and 6-7 have the same number of 
hyperparameters, but the number of optimized hyperparameters differs. 

Determination of GA parameter values, namely crossover rate (CR) and mutation rate (MR), is 
essential in the hyperparameter optimization process. This paper determines CR=0.8 and MR = 0.2. This value 
is determined based on papers [17], which refer to several papers [8], [10]. The number of generations (Ngen) 
and population (Npop) selected is 25 in this experiment. Generally, a high population size and a large 
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generation size will result in better performance. However, this selection will take longer. Based on previous 
research, the selection of small sizes has also been widely used and proven to produce a good performance 
[8]-[10], [17], [31]. 


Table 1. Hyperparameter design for each experiment 


Description Experiment 
1 2 3 4 5 6 
Global NE Vv Vv Vv Vv Vv Vv 
BS Vv Vv Vv Vv Vv Vv 
OP Vv Vv Vv Vv Vv Vv 
LR Vv Vv Vv Vv Vv Vv 
MO Vv 
Layer NF VE VRE Oy oe ek ee 


KIC Ve VEE VR VR 
WCC Ve VRE YR Or 
NN V® Ve Ve ee ee ae 
AFD Ve VRE YR rR 
KID VE OVE OV 
WCD Ve OVER VR VK 
DR Vv Vv 
PS Ve Ve 
KIO Vv Vv 
Architecture NC Vv Vv Vv Vv Vv Vv 
ND Vv Vv Vv Vv Vv Vv 
Hyperparameter numbers 6 10 10 15 15 20 20 


Optimized hyperparameters numbers 6 10 66 15141 20 ~=159 
Notes: *) All layers with the same Value, **) All layers with different values 


This study used an evaluation matrix, which is accuracy, in all experiments. Figure 5 shows the 
minimum and maximum accuracy results in all experiments. This Figure shows that the higher the size of the 
population (Npop), the accuracy increases. Experiment 3 (EX 3) has superior accuracy than the other 
experiments. Meanwhile, Figure 6 shows Mean and Standart Deviasion of Accuracy on CNN-GA. Figure 6 is 
also in line with Figure 5; a higher number of population sizes results in better average performance. 
Meanwhile, each population in each generation produces almost the same accuracy value. This means that the 
distribution of accuracy performance results is uniform so that each population produces a small standard 
deviation value. EX3 excels with other experiments. Figure 7 shows the average distance (AvgDis) and time 
execution of accuracy in all experiments. 

The smaller AvgDis of each population, the better the accuracy performance produced. This figure 
shows that EX1 and EX3 have a stable AvgDis, but EX3 has better performance than EX1. Then the experiment 
resulted in better accuracy performance requiring a longer time. This is in accordance with previous studies, 
showing that to produce optimal hyperparameters using GA requires more time. Therefore EX3 is an 
experiment with a longer time than other experiments. 

As Figures 5-7 show EX3 is the best experiment among other experiments, with an accuracy of 
93.77% and a total execution time of 10 hours 22 minutes. These results indicate that the GA's hyperparameter 
optimization process can produce the best accuracy if the execution time is long. The comparison of the best 
accuracy and total execution time is shown in Figure 8. 

Furthermore, this paper also presents a comparison of CNN+RS as a comparison of the proposed 
model shown in Figure 9. Similarly, CNN+RS was built in seven experiments with the same list and range of 
hyperparameters on CNN+GA. This study set iterations = 150 to run CNN+RS. The best accuracy results show 
that CNN+GA is superior in all experiments from CNN+RS. Meanwhile, the total execution time resulted in 
CNN+GA being longer in all experiments than CNN+RS. The best comparison of the best accuracy and total 
execution time of the two models is shown in Figure 8. Besides RS, this study executed GS by optimizing only 
six hyperparameters from EX1. This choice was made because only EX1 could cover all 525 CNN's evaluated. 
If all hyperparameters were optimized, it would require at least 1,48,576 CNNs. Meanwhile, GS could not 
optimize a number of these CNNs, and the characteristics of CNNs are known to be time-consuming [32]. 
Therefore, this study used the list and range hyperparameters in GS, similar to the paper [17] presented in 
Table 3. 
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Figure 5. Minimum and maximum of accuracy on CNN-GA 
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Figure 7. The average distance and time execution of accuracy on CNN-GA 
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Figure 9. The comparison of CNN+GA and CNN+RS 
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Table 3. Hyperparameter value sets of GS [17] 


Hyperparameter Set of Values 
NE [20,40,60,80, 100] 
BS [64,256] 
OP [‘Adam’, ‘Adamax’, ‘Nadam’, ‘Adagrad’, ‘ RmsProp’ ] 
LR [le-3, le-2] 
NCL [5, 10, 15]] 
NDL [5, 10, 15]] 


Based on the comparison of the three models, it is found that CNN+GA is superior to CNN+GS and 
CNN+RS in terms of accuracy performance. The comparison results can be seen in Figure 10. This result is in 
line with the research obtained by Fatyanosa suggesting that. GA is still superior to GS and RS in optimizing 
CNN hyperparameters with image and text datasets. 


Best Accuracy 


CNNeRS CNN+GA CNN+GS 


Model 
Figure 10. The comparison of three models 
The resulting architecture in the best model is shown in Figure 11. The architecture consists of four 


convolution layers and four dense layers. Deep architecture can study a wide variety of handwritten character 
datasets and generate architecture for learning handwritten datasets. 
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Figure 11. The best individual GA-CNN in EX3 


4. CONCLUSION 

This paper describes an automated approach to optimize 20 CNN hyperparameters using GA 
(CNN+GA) on an English handwritten dataset. The results of the seven experiments show that the larger 
number of hyperparameters and layer-specific hyperparameters are sometimes important as the use of max- 
norm weight constraint hyperparameters does not significantly impact the search process. It is the standard 
CNN architecture that produces the best performance compared to CNN with complete hyperparameters. The 
best result in this study is that EX3 achieves an accuracy of 97.77%. 
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